Update dask docs #1532

lhoestq · 2024-12-17T14:52:55Z

link to blog post for real world use case
explain how dask parallelism works
recommend squash history after writing to HF
add a distributed data processing example
explain predicate and projection pushdown + example

HuggingFaceDocBuilderDev · 2024-12-17T14:54:39Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

davanstrien

Looks great! If you don't have a chance I can also add something similar for Polars. I had one question about how Dask does filtering for the downloads. I probably misunderstood something here.

docs/hub/datasets-dask.md

davanstrien · 2024-12-17T16:24:54Z

docs/hub/datasets-dask.md

+
+# Dask will skip the files or row groups that don't
+# match rhe query without downloading them.
+df = df[df.dump >= "CC-MAIN-2023"]


Does Dask not still need to download the data to check the values in this column match this filter? From what I understood in the Polars case the predicate push down is usually used for skipping the reading of a column i.e. if you drop it later it doesn't bother to load it and/or doing a filtering step early on. Is Dask directly able to do this before loading?

So it will skip the row groups which don't have any row that matches the query using the row group metadata

Then on the remaining row groups it will download the column used for filtering to apply the filter

The other columns will be downloaded or not based on the other operations done on the dataset

if you drop it later it doesn't bother to load it and/or doing a filtering step early on. Is Dask directly able to do this before loading?

yes correct ! would be cool to explain that here as well

That's super cool!! For some datasets the download time does seem to end up becoming a blocker so this is very neat!

Co-authored-by: Daniel van Strien <[email protected]>

docs/hub/datasets-dask.md

lhoestq · 2024-12-17T17:52:35Z

thanks for the review ! merging this one for now but lmk if you have more comments

update dask docs

ebf4a99

lhoestq requested a review from davanstrien December 17, 2024 14:53

typo

8c76816

lhoestq added 2 commits December 17, 2024 16:52

add column and filter pushdown

31ed970

minor

af4604e

davanstrien approved these changes Dec 17, 2024

View reviewed changes

Apply suggestions from code review

dc5f8e9

Co-authored-by: Daniel van Strien <[email protected]>

davanstrien reviewed Dec 17, 2024

View reviewed changes

docs/hub/datasets-dask.md Outdated Show resolved Hide resolved

lhoestq added 2 commits December 17, 2024 18:48

Update docs/hub/datasets-dask.md

d320b77

explain column skipping better

7ac13a5

lhoestq merged commit 0ea864e into main Dec 17, 2024
2 checks passed

lhoestq deleted the update-dask-docs branch December 17, 2024 17:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update dask docs #1532

Update dask docs #1532

Uh oh!

lhoestq commented Dec 17, 2024 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Dec 17, 2024

Uh oh!

davanstrien left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davanstrien Dec 17, 2024

Uh oh!

lhoestq Dec 17, 2024 •

edited

Loading

Uh oh!

lhoestq Dec 17, 2024

Uh oh!

davanstrien Dec 17, 2024

Uh oh!

Uh oh!

lhoestq commented Dec 17, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Update dask docs #1532

Update dask docs #1532

Uh oh!

Conversation

lhoestq commented Dec 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Dec 17, 2024

Uh oh!

davanstrien left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davanstrien Dec 17, 2024

Choose a reason for hiding this comment

Uh oh!

lhoestq Dec 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lhoestq Dec 17, 2024

Choose a reason for hiding this comment

Uh oh!

davanstrien Dec 17, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lhoestq commented Dec 17, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lhoestq commented Dec 17, 2024 •

edited

Loading

lhoestq Dec 17, 2024 •

edited

Loading